AITopics

2511.12191

Country: Europe (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Medlin, Karen, Leyffer, Sven, Raghavan, Krishnan

Sampling Imbalanced Data with Multi-objective Bilevel Optimization

arXiv.org Artificial IntelligenceJul-11-2025

Two-class classification problems are often characterized by an imbalance between the number of majority and minority datapoints resulting in poor classification of the minority class in particular. Traditional approaches, such as reweighting the loss function or naïve resampling, risk overfitting and subsequently fail to improve classification because they do not consider the diversity between majority and minority datasets. Such consideration is infeasible because there is no metric that can measure the impact of imbalance on the model. To obviate these challenges, we make two key contributions. First, we introduce MOODS~(Multi-Objective Optimization for Data Sampling), a novel multi-objective bilevel optimization framework that guides both synthetic oversampling and majority undersampling. Second, we introduce a validation metric -- `$ε/ δ$ non-overlapping diversification metric' -- that quantifies the goodness of a sampling method towards model performance. With this metric we experimentally demonstrate state-of-the-art performance with improvement in diversity driving a $1-15 \%$ increase in $F1$ scores.

artificial intelligence, machine learning, training data, (16 more...)

2506.11315

Country:

Europe (0.46)
North America > United States (0.28)

Genre: Research Report (1.00)

Industry:

Law Enforcement & Public Safety > Fraud (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)

Raslan, Khaled SH., Alsharkawy, Almohammady S., Raslan, K. R.

iHHO-SMOTe: A Cleansed Approach for Handling Outliers and Reducing Noise to Improve Imbalanced Data Classification

arXiv.org Artificial IntelligenceApr-18-2025

Classifying imbalanced datasets remains a significant challenge in machine learning, particularly with big data where instances are unevenly distributed among classes, leading to class imbalance issues that impact classifier performance. While Synthetic Minority Over-sampling Technique (SMOTE) addresses this challenge by generating new instances for the under-represented minority class, it faces obstacles in the form of noise and outliers during the creation of new samples. In this paper, a proposed approach, iHHO-SMOTe, which addresses the limitations of SMOTE by first cleansing the data from noise points. This process involves employing feature selection using a random forest to identify the most valuable features, followed by applying the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to detect outliers based on the selected features. The identified outliers from the minority classes are then removed, creating a refined dataset for subsequent oversampling using the hybrid approach called iHHO-SMOTe. The comprehensive experiments across diverse datasets demonstrate the exceptional performance of the proposed model, with an AUC score exceeding 0.99, a high G-means score of 0.99 highlighting its robustness, and an outstanding F1-score consistently exceeding 0.967. These findings collectively establish Cleansed iHHO-SMOTe as a formidable contender in addressing imbalanced datasets, focusing on noise reduction and outlier handling for improved classification models.

artificial intelligence, data mining, machine learning, (5 more...)

2504.1285

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.53)

Medlin, Karen, Leyffer, Sven, Raghavan, Krishnan

A Bilevel Optimization Framework for Imbalanced Data Classification

arXiv.org Machine LearningDec-16-2024

Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new undersampling approach that: (i) avoids the pitfalls of noise and overlap caused by synthetic data and (ii) avoids the pitfall of under-fitting caused by random undersampling. Instead of undersampling majority data randomly, our method undersamples datapoints based on their ability to improve model loss. Using improved model loss as a proxy measurement for classification performance, our technique assesses a datapoint's impact on loss and rejects those unable to improve it. In so doing, our approach rejects majority datapoints redundant to datapoints already accepted and, thereby, finds an optimal subset of majority training data for classification. The accept/reject component of our algorithm is motivated by a bilevel optimization problem uniquely formulated to identify the optimal training set we seek. Experimental results show our proposed technique with F1 scores up to 10% higher than state-of-the-art methods.

artificial intelligence, datapoint, machine learning, (17 more...)

2410.11171

Country:

North America > United States > North Carolina (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > South Korea > Gyeongsangnam-do > Changwon (0.04)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.48)

Industry:

Law Enforcement & Public Safety > Fraud (0.47)
Government > Regional Government (0.47)
Health & Medicine (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.72)

#artificialintelligenceJan-26-2022, 11:20:13 GMT

Adaptive fusion based method for imbalanced data classification

The imbalance problem is widespread in real-world applications. When training a classifier on the imbalance datasets, the classifier is hard to learn an appropriate decision boundary, which causes unsatisfying classification performance. To deal with the imbalance problem, various ensemble algorithms are proposed. However, conventional ensemble algorithms do not consider exploring an effective feature subspace to further improve the performance. In addition, they treat the base classifiers equally, and ignore the different contribution of each base classifier to the ensemble result. In order to address these problems, we propose a novel ensemble algorithm that combines effective data transform and adaptive fusion scheme. First, we utilize modified metric learning to obtain an effective feature space based on imbalanced data. Next, the base classifiers are assigned different weights adaptively. The experiments on multiple imbalanced datasets, including images and biomedical dataset, verify the superiority of our proposed ensemble algorithm.

classifier, ensemble algorithm, imbalanced data classification, (4 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Kraiem, Mohamed S., Sánchez-Hernández, Fernando, Moreno-García, María N.

Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties

arXiv.org Artificial IntelligenceDec-15-2021

In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Thus, the prediction model is unreliable although the overall model accuracy can be acceptable. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class. However, their effectiveness depends on several factors mainly related to data intrinsic characteristics, such as imbalance ratio, dataset size and dimensionality, overlapping between classes or borderline examples. In this work, the impact of these factors is analyzed through a comprehensive comparative study involving 40 datasets from different application areas. The objective is to obtain models for automatic selection of the best resampling strategy for any dataset based on its characteristics. These models allow us to check several factors simultaneously considering a wide range of values since they are induced from very varied datasets that cover a broad spectrum of conditions. This differs from most studies that focus on the individual analysis of the characteristics or cover a small range of values. In addition, the study encompasses both basic and advanced resampling strategies that are evaluated by means of eight different performance metrics, including new measures specifically designed for imbalanced data classification. The general nature of the proposal allows the choice of the most appropriate method regardless of the domain, avoiding the search for special purpose techniques that could be valid for the target data.

classification, dataset, imbalance ratio, (15 more...)

doi: 10.3390/app11188546

2201.07932

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Heidelberg (0.04)
North America > United States > District of Columbia > Washington (0.04)
(16 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.68)

Industry:

Information Technology > Security & Privacy (0.87)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.93)

arXiv.org Machine LearningApr-17-2021

Potential Anchoring for imbalanced data classification

Koziarski, Michał

Data imbalance remains one of the factors negatively affecting the performance of contemporary machine learning algorithms. One of the most common approaches to reducing the negative impact of data imbalance is preprocessing the original dataset with data-level strategies. In this paper we propose a unified framework for imbalanced data over- and undersampling. The proposed approach utilizes radial basis functions to preserve the original shape of the underlying class distributions during the resampling process. This is done by optimizing the positions of generated synthetic observations with respect to the potential resemblance loss. The final Potential Anchoring algorithm combines over- and undersampling within the proposed framework. The results of the experiments conducted on 60 imbalanced datasets show outperformance of Potential Anchoring over state-of-the-art resampling algorithms, including previously proposed methods that utilize radial basis functions to model class potential. Furthermore, the results of the analysis based on the proposed data complexity index show that Potential Anchoring is particularly well suited for handling naturally complex (i.e. not affected by the presence of noise) datasets.

algorithm, dataset, imbalanced data, (15 more...)

2104.08548

Country:

South America > Uruguay > Maldonado > Maldonado (0.04)
North America > United States (0.04)
Europe > Poland > Lesser Poland Province > Kraków (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Mansourifar, Hadi, Shi, Weidong

Towards Stable Imbalanced Data Classification via Virtual Big Data Projection

arXiv.org Machine LearningAug-23-2020

Virtual Big Data (VBD) proved to be effective to alleviate mode collapse and vanishing generator gradient as two major problems of Generative Adversarial Neural Networks (GANs) very recently. In this paper, we investigate the capability of VBD to address two other major challenges in Machine Learning including deep autoencoder training and imbalanced data classification. First, we prove that, VBD can significantly decrease the validation loss of autoencoders via providing them a huge diversified training data which is the key to reach better generalization to minimize the over-fitting problem. Second, we use the VBD to propose the first projection-based method called cross-concatenation to balance the skewed class distributions without over-sampling. We prove that, cross-concatenation can solve uncertainty problem of data driven methods for imbalanced classification.

artificial intelligence, autoencoder, machine learning, (15 more...)

2009.08387

Country:

North America > United States > Texas > Harris County > Houston (0.04)
South America > Uruguay > Maldonado > Maldonado (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)

Genre: Research Report > New Finding (0.94)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Ratadiya, Pratik, Moorthy, Rahul

Spam filtering on forums: A synthetic oversampling based approach for imbalanced data classification

arXiv.org Machine LearningSep-10-2019

Forums play an important role in providing a platform for community interaction. The introduction of irrelevant content or spam by individuals for commercial and social gains tends to degrade the professional experience presented to the forum users. Automated moderation of the relevancy of posted content is desired. Machine learning is used for text classification and finds applications in spam email detection, fraudulent transaction detection etc. The balance of classes in training data is essential in the case of classification algorithms to make the learning efficient and accurate. However, in the case of forums, the spam content is sparse compared to the relevant content giving rise to a bias towards the latter while training. A model trained on such biased data will fail to classify a spam sample. An approach based on Synthetic Minority Over-sampling Technique(SMOTE) is presented in this paper to tackle imbalanced training data. It involves synthetically creating new minority class samples from the existing ones until balance in data is achieved. The enhanced data is then passed through various classifiers for which the performance is recorded. The results were analyzed on the data of forums of Spoken Tutorial, IIT Bombay over standard performance metrics and revealed that models trained after Synthetic Minority oversampling outperform the ones trained on imbalanced data by substantial margins. An empirical comparison of the results obtained by both SMOTE and without SMOTE for various supervised classification algorithms have been presented in this paper. Synthetic oversampling proves to be a critical technique for achieving uniform class distribution which in turn yields commendable results in text classification. The presented approach can be further extended to content categorization on educational websites thus helping to improve the overall digital learning experience.

machine learning, natural language, text classification, (17 more...)

1909.04826

Country: Asia > India (0.15)

Genre:

Research Report (0.51)
Instructional Material (0.47)

Industry:

Education (0.87)
Law Enforcement & Public Safety > Fraud (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.98)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.69)